26 research outputs found
A Novel Focal Tversky loss function with improved Attention U-Net for lesion segmentation
We propose a generalized focal loss function based on the Tversky index to
address the issue of data imbalance in medical image segmentation. Compared to
the commonly used Dice loss, our loss function achieves a better trade off
between precision and recall when training on small structures such as lesions.
To evaluate our loss function, we improve the attention U-Net model by
incorporating an image pyramid to preserve contextual features. We experiment
on the BUS 2017 dataset and ISIC 2018 dataset where lesions occupy 4.84% and
21.4% of the images area and improve segmentation accuracy when compared to the
standard U-Net by 25.7% and 3.6%, respectively.Comment: submitted to 2019 IEEE International Symposium on Biomedical Imaging
(ISBI
A Novel Image-centric Approach Towards Direct Volume Rendering
Transfer Function (TF) generation is a fundamental problem in Direct Volume
Rendering (DVR). A TF maps voxels to color and opacity values to reveal inner
structures. Existing TF tools are complex and unintuitive for the users who are
more likely to be medical professionals than computer scientists. In this
paper, we propose a novel image-centric method for TF generation where instead
of complex tools, the user directly manipulates volume data to generate DVR.
The user's work is further simplified by presenting only the most informative
volume slices for selection. Based on the selected parts, the voxels are
classified using our novel Sparse Nonparametric Support Vector Machine
classifier, which combines both local and near-global distributional
information of the training data. The voxel classes are mapped to aesthetically
pleasing and distinguishable color and opacity values using harmonic colors.
Experimental results on several benchmark datasets and a detailed user survey
show the effectiveness of the proposed method.Comment: To appear in the ACM Transactions in Intelligent Systems and
Technolog
Multi-level Stress Assessment Using Multi-domain Fusion of ECG Signal
Stress analysis and assessment of affective states of mind using ECG as a
physiological signal is a burning research topic in biomedical signal
processing. However, existing literature provides only binary assessment of
stress, while multiple levels of assessment may be more beneficial for
healthcare applications. Furthermore, in present research, ECG signal for
stress analysis is examined independently in spatial domain or in transform
domains but the advantage of fusing these domains has not been fully utilized.
To get the maximum advantage of fusing diferent domains, we introduce a dataset
with multiple stress levels and then classify these levels using a novel deep
learning approach by converting ECG signal into signal images based on R-R
peaks without any feature extraction. Moreover, We made signal images
multimodal and multidomain by converting them into time-frequency and frequency
domain using Gabor wavelet transform (GWT) and Discrete Fourier Transform (DFT)
respectively. Convolutional Neural networks (CNNs) are used to extract features
from different modalities and then decision level fusion is performed for
improving the classification accuracy. The experimental results on an in-house
dataset collected with 15 users show that with proposed fusion framework and
using ECG signal to image conversion, we reach an average accuracy of 85.45%
Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors
One of the major reasons for misclassification of multiplex actions during
action recognition is the unavailability of complementary features that provide
the semantic information about the actions. In different domains these features
are present with different scales and intensities. In existing literature,
features are extracted independently in different domains, but the benefits
from fusing these multidomain features are not realized. To address this
challenge and to extract complete set of complementary information, in this
paper, we propose a novel multidomain multimodal fusion framework that extracts
complementary and distinct features from different domains of the input
modality. We transform input inertial data into signal images, and then make
the input modality multidomain and multimodal by transforming spatial domain
information into frequency and time-spectrum domain using Discrete Fourier
Transform (DFT) and Gabor wavelet transform (GWT) respectively. Features in
different domains are extracted by Convolutional Neural networks (CNNs) and
then fused by Canonical Correlation based Fusion (CCF) for improving the
accuracy of human action recognition. Experimental results on three inertial
datasets show the superiority of the proposed method when compared to the
state-of-the-art
Machine Learning on Biomedical Images: Interactive Learning, Transfer Learning, Class Imbalance, and Beyond
In this paper, we highlight three issues that limit performance of machine
learning on biomedical images, and tackle them through 3 case studies: 1)
Interactive Machine Learning (IML): we show how IML can drastically improve
exploration time and quality of direct volume rendering. 2) transfer learning:
we show how transfer learning along with intelligent pre-processing can result
in better Alzheimer's diagnosis using a much smaller training set 3) data
imbalance: we show how our novel focal Tversky loss function can provide better
segmentation results taking into account the imbalanced nature of segmentation
datasets. The case studies are accompanied by in-depth analytical discussion of
results with possible future directions.Comment: Accepted at IEEE MIPR 2019. arXiv admin note: text overlap with
arXiv:1810.0784
Motion Vector Extrapolation for Video Object Detection
Despite the continued successes of computationally efficient deep neural
network architectures for video object detection, performance continually
arrives at the great trilemma of speed versus accuracy versus computational
resources (pick two). Current attempts to exploit temporal information in video
data to overcome this trilemma are bottlenecked by the state-of-the-art in
object detection models. We present, a technique which performs video object
detection through the use of off-the-shelf object detectors alongside existing
optical flow based motion estimation techniques in parallel. Through a set of
experiments on the benchmark MOT20 dataset, we demonstrate that our approach
significantly reduces the baseline latency of any given object detector without
sacrificing any accuracy. Further latency reduction, up to 25x lower than the
original latency, can be achieved with minimal accuracy loss. MOVEX enables low
latency video object detection on common CPU based systems, thus allowing for
high performance video object detection beyond the domain of GPU computing. The
code is available at https://github.com/juliantrue/movex
Facial Expression Recognition Under Partial Occlusion from Virtual Reality Headsets based on Transfer Learning
Facial expressions of emotion are a major channel in our daily
communications, and it has been subject of intense research in recent years. To
automatically infer facial expressions, convolutional neural network based
approaches has become widely adopted due to their proven applicability to
Facial Expression Recognition (FER) task.On the other hand Virtual Reality (VR)
has gained popularity as an immersive multimedia platform, where FER can
provide enriched media experiences. However, recognizing facial expression
while wearing a head-mounted VR headset is a challenging task due to the upper
half of the face being completely occluded. In this paper we attempt to
overcome these issues and focus on facial expression recognition in presence of
a severe occlusion where the user is wearing a head-mounted display in a VR
setting. We propose a geometric model to simulate occlusion resulting from a
Samsung Gear VR headset that can be applied to existing FER datasets. Then, we
adopt a transfer learning approach, starting from two pretrained networks,
namely VGG and ResNet. We further fine-tune the networks on FER+ and RAF-DB
datasets. Experimental results show that our approach achieves comparable
results to existing methods while training on three modified benchmark datasets
that adhere to realistic occlusion resulting from wearing a commodity VR
headset. Code for this paper is available at:
https://github.com/bita-github/MRP-FERComment: To be presented at the IEEE BigMM 202
Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors
Multimodal fusion frameworks for Human Action Recognition (HAR) using depth
and inertial sensor data have been proposed over the years. In most of the
existing works, fusion is performed at a single level (feature level or
decision level), missing the opportunity to fuse rich mid-level features
necessary for better classification. To address this shortcoming, in this
paper, we propose three novel deep multilevel multimodal fusion frameworks to
capitalize on different fusion strategies at various stages and to leverage the
superiority of multilevel fusion. At input, we transform the depth data into
depth images called sequential front view images (SFIs) and inertial sensor
data into signal images. Each input modality, depth and inertial, is further
made multimodal by taking convolution with the Prewitt filter. Creating
"modality within modality" enables further complementary and discriminative
feature extraction through Convolutional Neural Networks (CNNs). CNNs are
trained on input images of each modality to learn low-level, high-level and
complex features. Learned features are extracted and fused at different stages
of the proposed frameworks to combine discriminative and complementary
information. These highly informative features are served as input to a
multi-class Support Vector Machine (SVM). We evaluate the proposed frameworks
on three publicly available multimodal HAR datasets, namely, UTD Multimodal
Human Action Dataset (MHAD), Berkeley MHAD, and UTD-MHAD Kinect V2.
Experimental results show the supremacy of the proposed fusion frameworks over
existing methods.Comment: 10 pages, 13 figure
Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data
This paper attempts at improving the accuracy of Human Action Recognition
(HAR) by fusion of depth and inertial sensor data. Firstly, we transform the
depth data into Sequential Front view Images(SFI) and fine-tune the pre-trained
AlexNet on these images. Then, inertial data is converted into Signal Images
(SI) and another convolutional neural network (CNN) is trained on these images.
Finally, learned features are extracted from both CNN, fused together to make a
shared feature layer, and these features are fed to the classifier. We
experiment with two classifiers, namely Support Vector Machines (SVM) and
softmax classifier and compare their performances. The recognition accuracies
of each modality, depth data alone and sensor data alone are also calculated
and compared with fusion based accuracies to highlight the fact that fusion of
modalities yields better results than individual modalities. Experimental
results on UTD-MHAD and Kinect 2D datasets show that proposed method achieves
state of the art results when compared to other recently proposed
visual-inertial action recognition methods
Deep Clustering with a Dynamic Autoencoder: From Reconstruction towards Centroids Construction
In unsupervised learning, there is no apparent straightforward cost function
that can capture the significant factors of variations and similarities. Since
natural systems have smooth dynamics, an opportunity is lost if an unsupervised
objective function remains static during the training process. The absence of
concrete supervision suggests that smooth dynamics should be integrated.
Compared to classical static cost functions, dynamic objective functions allow
to better make use of the gradual and uncertain knowledge acquired through
pseudo-supervision. In this paper, we propose Dynamic Autoencoder (DynAE), a
novel model for deep clustering that overcomes a clustering-reconstruction
trade-off, by gradually and smoothly eliminating the reconstruction objective
function in favor of a construction one. Experimental evaluations on benchmark
datasets show that our approach achieves state-of-the-art results compared to
the most relevant deep clustering methods